Skip to content

feat(backend/kernel): add use_kernel=True flag — route through the Rust kernel via PyO3#787

Open
vikrantpuppala wants to merge 9 commits into
mainfrom
feat/kernel-backend
Open

feat(backend/kernel): add use_kernel=True flag — route through the Rust kernel via PyO3#787
vikrantpuppala wants to merge 9 commits into
mainfrom
feat/kernel-backend

Conversation

@vikrantpuppala
Copy link
Copy Markdown
Contributor

@vikrantpuppala vikrantpuppala commented May 14, 2026

Summary

Phase 2 of the PySQL × kernel integration plan (design doc). Adds a new opt-in use_kernel=True connection flag that routes through a new backend/kernel/ module delegating to the Rust kernel via the databricks_sql_kernel PyO3 extension (kernel PR #13).

This replaces the previous ADBC POC branches (backend/adbc/ and backend/adbc_dm/ on adbc-rust-backend-via-dm, which were never merged) with a clean port that uses the kernel's v0 Databricks-native API directly instead of layering through ADBC.

Flag semantics

Flag Routes to Status
use_kernel=True new KernelDatabricksClient (Rust kernel via PyO3) new, opt-in, in active development
use_sea=True existing SeaDatabricksClient (pure-Python SEA) unchanged
neither ThriftDatabricksClient (Thrift) unchanged

use_kernel=True and use_sea=True are mutually exclusive — passing both raises ValueError. Existing use_sea=True callers are unaffected.

What use_kernel=True does

import databricks.sql

# PAT auth — works end-to-end today
with databricks.sql.connect(
    server_hostname=...,
    http_path=...,
    access_token="dapi...",
    use_kernel=True,
) as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT * FROM main.default.t LIMIT 100")
        rows = cur.fetchall()  # rows come from the kernel via PyO3

New module layout

File Purpose
src/databricks/sql/backend/kernel/client.py KernelDatabricksClient(DatabricksClient). Lazy-imports databricks_sql_kernel so a connector install without the kernel wheel doesn't fail at startup — only use_kernel=True surfaces the missing-extra ImportError.
src/databricks/sql/backend/kernel/auth_bridge.py Translates the connector's AuthProvider to kernel Session auth kwargs. PAT (including TokenFederationProvider-wrapped PAT — every provider is wrapped, so the naive isinstance(AccessTokenAuthProvider) check has to look through the wrapper) routes through auth_type='pat'. Anything else raises NotSupportedError until the kernel exposes a full external-auth surface.
src/databricks/sql/backend/kernel/result_set.py KernelResultSet(ResultSet). Duck-typed over the kernel's ExecutedStatement (sync exec) and ResultStream (metadata + async await_result); both expose arrow_schema / fetch_next_batch / close. FIFO batch buffer for fetchmany(n) semantics, with O(1) buffered-row accounting via a running counter.
src/databricks/sql/backend/kernel/type_mapping.py Arrow → PEP 249 description-string mapper.

Error mapping

KernelError.code → PEP 249 exception class, in a single table in client.py. Structured fields (sql_state, error_code, query_id, http_status, retryable, vendor_code) are copied onto the re-raised exception so callers can branch on err.code / err.sql_state directly. Live e2e verified: bad SQL on use_kernel=True surfaces as DatabaseError(code='SqlError', sql_state='42P01').

Packaging

Without the kernel wheel, use_kernel=True raises:

ImportError: use_kernel=True requires the databricks-sql-kernel package.
Install it with: pip install databricks-sql-kernel

Local dev: cd databricks-sql-kernel/pyo3 && maturin develop --release into the connector's venv. (The [kernel] extra is intentionally not declared in pyproject.toml yet — databricks-sql-kernel isn't on PyPI, and declaring an unpublished dep breaks poetry lock for every CI job. The extra will land once the wheel is on PyPI.)

⚠️ Known gaps — acknowledged follow-ups

Gap Surface Why deferred
Parameter binding execute_command(parameters=[...]) raises NotSupportedError PyO3 Statement.bind_param lands in a small follow-up PR on the kernel repo
Statement-level query_tags Raises NotSupportedError Same follow-up — PyO3 needs to expose statement_conf
get_columns(catalog_name=None) Raises ProgrammingError (kernel's SHOW COLUMNS cannot span catalogs) Documented divergence from Thrift/SEA; called out on Cursor.columns() docstring
OAuth / federation / external-auth NotSupportedError from the auth bridge Kernel-side enablement PR covers this — once it lands, OAuth/MSAL/federation start working through this backend automatically
Volume PUT/GET (staging) Not supported Kernel has no Volume API yet
Telemetry parity Per-statement execution_result / retry_count / chunk-level latency under-report on use_kernel=True Needs kernel-side hooks; will plumb once available
Connector kwargs (TLS / proxy / retry / pool) Honored by the kernel's own HTTP stack rather than the connector's; user-supplied values via ssl_options / http_headers / http_client are accepted-and-ignored Kernel manages its own HTTP stack today; will plumb a per-knob bridge as kernel surfaces them

Code review feedback addressed in this revision

Multi-reviewer (architecture, security, ops, performance, test, maintainability, agent-compat, language, devil's advocate) review surfaced several issues; the highest-impact ones are addressed in commits 37fa5446 (mechanical) and 24e9a5c2 (substantive):

  • Renamed the flag from repurposing use_sea=True to a dedicated use_kernel=True; native SEA routing is unchanged. (Was the largest reviewer concern.)
  • Implemented client-side table_types filter in get_tables using the SEA backend's _filter_arrow_table helper, replacing the previous "log a warning and return unfiltered" behaviour.
  • Replaced O(M²) buffer accounting in KernelResultSet with a running counter (_buffered_count).
  • KernelResultSet.close() now drops the entry from backend._async_handles to avoid stale references.
  • threading.RLock around _async_handles mutations / reads for concurrent-cursor safety.
  • Synthetic metadata CommandIds use plain uuid.uuid4().hex (no metadata- prefix) so cursor.query_id stays parseable downstream.
  • Raw PAT cleared from KernelDatabricksClient._auth_kwargs after the kernel Session is constructed.
  • Bearer-token control-char sanitization in the auth bridge.
  • Removed dead _use_arrow_native_complex_types kwarg; tightened _reraise_kernel_error signature and dropped dead branch; collapsed three KernelResultSet(...) construction sites through one _make_result_set helper.
  • Cursor.columns() docstring now documents the catalog_name=None divergence on use_kernel=True.
  • Federation-wrapped PAT unit test now constructs a real TokenFederationProvider(http_client=Mock()) instead of bypassing __init__.

Test plan

  • Unit tests — 75 cases across the kernel suite:
    • tests/unit/test_kernel_client.py (new in this revision, 38 cases) — covers _CODE_TO_EXCEPTION (14 entries), _reraise_kernel_error attribute forwarding, _STATE_TO_COMMAND_STATE (6 entries), all no-open-session guards, open_session double-open, parameters / query_tags rejection, get_columns catalog-required, cancel_command / close_command tolerance, get_query_state sync-path SUCCEEDED and Failed-state re-raise, synthetic CommandId shape, close_session cleanup on partial close failures. Uses a fake databricks_sql_kernel module so the test runs with no Rust extension dependency.
    • tests/unit/test_kernel_auth_bridge.py — PAT, federation-wrapped PAT (now via real TokenFederationProvider(http_client=Mock())), non-PAT rejection paths.
    • tests/unit/test_kernel_type_mapping.py — Arrow type mapping per type, description-tuple shape, fallback to str() for unknowns.
    • tests/unit/test_kernel_result_set.py — buffer semantics, fetchmany slicing within batch + across batch boundaries, idempotent close, close() swallowing handle-close failures, empty stream.
  • Full pre-existing unit suite — passes; the one pre-existing failure (test_useragent_header — agent detection adds agent/claude-code in this env, fails on main too) is unrelated to this change.
  • Live e2e against dogfood with use_kernel=True (PAT): SELECT 1, SELECT * FROM range(10000), fetchmany pacing, fetchall_arrow, all four metadata calls (catalogs / schemas / tables / columns), session_configuration={'ANSI_MODE': 'false'} round-trips, bad SQL surfaces as DatabaseError with code='SqlError' and sql_state='42P01'.

This pull request and its description were written by Isaac.

Phase 2 of the PySQL × kernel integration plan
(databricks-sql-kernel/docs/designs/pysql-kernel-integration.md).
Wires `use_sea=True` to a new `backend/kernel/` module that
delegates to the Rust kernel via the `databricks_sql_kernel` PyO3
extension (kernel PR #13).

New module: `src/databricks/sql/backend/kernel/`

- `client.py` — `KernelDatabricksClient(DatabricksClient)`. Lazy-
  imports `databricks_sql_kernel` so a connector install without
  the kernel wheel doesn't `ImportError` at startup; only
  `use_sea=True` surfaces the missing-extra message. Implements
  open/close_session, sync + async execute_command (async_op=True
  goes through `Statement.submit()` and stashes the handle in a
  dict keyed on `CommandId`), cancel/close_command,
  get_query_state, get_execution_result, and the metadata calls
  (catalogs / schemas / tables / columns) via
  `Session.metadata().list_*`. Real server-issued session and
  statement IDs flow through (no synthetic UUIDs).
- `auth_bridge.py` — translate the connector's `AuthProvider`
  into kernel `Session` kwargs. PAT (including federation-wrapped
  PAT — `get_python_sql_connector_auth_provider` always wraps the
  base in `TokenFederationProvider`, so a naive isinstance check
  never matches) routes through `auth_type="pat"`. Everything
  else routes through `auth_type="external"` with a callback that
  delegates to `auth_provider.add_headers({})`. (External today
  is rejected by the kernel at `build_auth_provider`; the
  separate kernel-side enablement PR will flip it on.)
- `result_set.py` — `KernelResultSet(ResultSet)`. Duck-typed over
  `databricks_sql_kernel.ExecutedStatement` (sync execute) and
  `ResultStream` (metadata + async await_result) since both
  expose `arrow_schema()` / `fetch_next_batch()` /
  `fetch_all_arrow()` / `close()`. Same FIFO batch buffer the
  prior ADBC POC used, so `fetchmany(n)` for n smaller than the
  kernel's natural batch size doesn't re-fetch.
- `type_mapping.py` — Arrow → PEP 249 description-string mapper.
  Lifted from the prior ADBC POC; centralised here so future
  kernel-result wrappers reuse the same mapping.

Kernel errors → PEP 249 exceptions: `KernelError.code` is mapped
in a single table to `ProgrammingError` / `OperationalError` /
`DatabaseError`. The structured fields (`sql_state`,
`error_code`, `query_id`, …) are copied onto the re-raised
exception so callers can branch on them without reaching through
`__cause__`.

Routing: `Session._create_backend` flips the `use_sea=True` branch
to instantiate `KernelDatabricksClient` instead of the native
`SeaDatabricksClient`. The native `backend/sea/` module is left
in place (no users on `use_sea=True` after this PR; its long-
term fate is out of scope here).

Packaging: `[tool.poetry.extras] kernel = ["databricks-sql-kernel"]`.
`pip install 'databricks-sql-connector[kernel]'` pulls in the
kernel wheel; `use_sea=True` without the extra raises a pointed
ImportError telling the user how to install it.

Known gaps (acknowledged, will be follow-ups):
- Parameter binding (`execute_command(parameters=[...])`) raises
  NotSupportedError — PyO3 `Statement.bind_param` lands in a
  follow-up.
- Statement-level `query_tags` raises NotSupportedError.
- `get_tables(table_types=[...])` returns unfiltered rows (the
  native SEA backend's filter is keyed on `SeaResultSet`; needs
  a small port to operate on `KernelResultSet`).
- External-auth end-to-end blocked on the kernel-side
  `AuthConfig::External` enablement PR.
- Volume PUT/GET (staging operations): kernel has no Volume API.

Test plan:
- Unit: 37 new tests across
  `tests/unit/test_kernel_auth_bridge.py` (auth provider →
  kwargs mapping, including federation-wrapped PAT and the
  External trampoline call-counter check),
  `tests/unit/test_kernel_type_mapping.py` (Arrow type mapping
  + description shape), and
  `tests/unit/test_kernel_result_set.py` (buffer semantics,
  fetchmany across batch boundaries, idempotent close, close()
  swallowing handle-close failures). All pass.
- Full unit suite: 600 pre-existing tests still pass; one
  pre-existing failure (`test_useragent_header` — agent
  detection adds `agent/claude-code` in this env) was already
  failing on main, unrelated to this change.
- Live e2e against dogfood with `use_sea=True`: SELECT 1,
  `range(10000)`, `fetchmany` pacing, `fetchall_arrow`, all four
  metadata calls (returned 75 catalogs / 144 schemas in main /
  47 tables in `system.information_schema` / 15 columns),
  `session_configuration={'ANSI_MODE': 'false'}` round-trips,
  bad SQL surfaces as DatabaseError with `code='SqlError'`
  and `sql_state='42P01'` on the exception. All checks pass.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
The earlier auth_bridge routed OAuth/MSAL/federation through the
kernel's External token-provider trampoline (a Python callable
the kernel invoked per HTTP request). Removing that for now.

Why: routing OAuth into the kernel inherently requires per-request
token resolution to keep refresh working during a long-running
session. Two viable mechanisms (kernel-native OAuth, or the
External callback); both have costs (duplicate OAuth flows vs
GIL-per-request). Punting the decision until there's actual
demand on use_sea=True.

Today: the bridge accepts PAT (including TokenFederationProvider-
wrapped PAT, which is how `get_python_sql_connector_auth_provider`
always shapes it). Any non-PAT auth_provider raises a clear
NotSupportedError pointing the user at use_sea=False (Thrift).

This shrinks the auth_bridge to ~50 lines and means the kernel-
side External enablement PR is no longer on the connector's
critical path — there's no kernel-side prerequisite for shipping
use_sea=True for PAT users.

Unit tests updated:
- TokenFederationProvider-wrapped PAT still routes to PAT (kept).
- Generic OAuth provider raises NotSupportedError (new).
- ExternalAuthProvider raises NotSupportedError (new).
- Silent non-PAT provider raises NotSupportedError (new) —
  reject the type itself rather than trying to extract a token
  we already know we can't use.

Live e2e against dogfood with use_sea=True (PAT): all checks
still pass (SELECT 1, range(10000), fetchmany pacing, four
metadata calls, session_configuration round-trip, structured
DatabaseError on bad SQL).

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
Moves the previously-ad-hoc /tmp/connector_smoke.py into the repo
as a real pytest module under tests/e2e/ — same convention as the
rest of the e2e suite. Uses the existing session-scoped
`connection_details` fixture from the top-level conftest so it
shares the credential surface with every other live test.

11 tests cover:
- connect() with use_sea=True opens a session.
- SELECT 1: rows + description shape (column name + dbapi type slug).
- SELECT * FROM range(10000): multi-batch drain.
- fetchmany() pacing across the buffer boundary.
- fetchall_arrow() returns a pyarrow Table.
- All four metadata methods (catalogs / schemas / tables / columns).
- session_configuration={'ANSI_MODE': 'false'} round-trips.
- Bad SQL surfaces as DatabaseError with `code='SqlError'` and
  `sql_state='42P01'` attached as exception attributes.

Module-level skips:
- `databricks_sql_kernel` not importable → whole module skipped via
  pytest.importorskip (the wheel hasn't been installed).
- Live creds missing → fixture-level skip with a pointed message.

Run: `pytest tests/e2e/test_kernel_backend.py -v`. All 11 pass
against dogfood in ~20s.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
@vikrantpuppala
Copy link
Copy Markdown
Contributor Author

Two updates since the initial PR:

1. Dropped External auth → PAT-only on the kernel backend (25723627). auth_bridge.py now routes PAT (including TokenFederationProvider-wrapped PAT) through the kernel's PAT path; everything else raises NotSupportedError pointing the user at use_sea=False. The kernel-side PR is no longer on this PR's critical path. Why: routing OAuth into the kernel requires per-request token resolution to keep refresh working — two viable mechanisms (kernel-native OAuth, or the External callback), both with costs. Punting the decision until there's pressure.

2. Live e2e tests moved into the repo (6b308156). The previous ad-hoc /tmp/connector_smoke.py is now tests/e2e/test_kernel_backend.py — a proper pytest module using the existing connection_details fixture. 11 tests cover: connect, SELECT 1, range(10000), fetchmany pacing, fetchall_arrow, all four metadata methods, session_configuration round-trip, structured DatabaseError on bad SQL. All 11 pass against dogfood in ~20s. Module skips cleanly when databricks_sql_kernel isn't installed or creds aren't set.

The auth_bridge unit tests are updated: OAuth providers / ExternalAuthProvider now assert NotSupportedError. 39 unit tests pass.

CI is failing across all jobs at \`poetry lock\` time:

    Because databricks-sql-connector depends on databricks-sql-kernel
    (^0.1.0) which doesn't match any versions, version solving failed.

The kernel wheel isn't yet published to PyPI — we verified the name
is available via the Databricks proxy, but the package itself hasn't
been built and uploaded yet. Declaring it as a poetry dep (even an
optional one inside an extra) requires the version to be resolvable,
and \`poetry lock\` runs as the setup step for every CI job: unit
tests, linting, type checks, all of them.

Fix: drop the \`databricks-sql-kernel\` dep declaration and the
\`[kernel]\` extra from pyproject.toml until the wheel is on PyPI.
The lazy import in \`backend/kernel/client.py\` still raises a
clear ImportError pointing at \`pip install databricks-sql-kernel\`
(or local maturin) when use_sea=True is invoked without the kernel
present.

When the kernel is published, a small follow-up will add back:

    databricks-sql-kernel = {version = "^0.1.0", optional = true}
    [tool.poetry.extras]
    kernel = ["databricks-sql-kernel"]

A pointed comment in pyproject.toml documents the deferred change.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
Three CI failures after the poetry-lock fix uncovered three real
issues:

1. pyarrow is optional in the connector. The default-deps CI test
   job installs without it; the +PyArrow job installs with. The
   kernel backend's result_set.py + type_mapping.py import pyarrow
   eagerly (the kernel always returns pyarrow), and the unit tests
   import the backend at collection time — which crashes the
   default-deps job at ModuleNotFoundError.

   Fix: gate the three kernel unit tests on `pytest.importorskip(
   "pyarrow")` so they skip on default-deps and run on +PyArrow.
   Verified locally: 39 pass with pyarrow, 3 skipped without.
   No change to the backend module itself — nothing imports it
   until use_sea=True is invoked, and pyarrow is on the kernel
   wheel's runtime dep list so use_sea=True can't hit this either.

2. mypy: KernelDatabricksClient.open_session returns
   self._session_id, which mypy types as Optional[SessionId]
   because the field starts as None. Fix: bind the new id to a
   local non-Optional variable, assign to the field, return the
   local. CI's check-types runs cleanly on backend/kernel/ now;
   pre-existing mypy noise elsewhere isn't mine.

3. black --check: black 22.12.0 (the version CI pins) wants
   reformatting on result_set.py / type_mapping.py / client.py.
   Applied. Verified locally with the same black version.

All 39 kernel unit tests + 619 pre-existing unit tests pass.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
The +PyArrow CI matrix installs pyarrow but not the
databricks-sql-kernel wheel (the wheel isn't on PyPI yet, and the
[kernel] extra is deferred — see commit 31ca581). The previous
fix gated unit tests on `pytest.importorskip("pyarrow")` but
test_kernel_auth_bridge.py was still pulled into a kernel-wheel
ImportError because:

  src/databricks/sql/backend/kernel/__init__.py
    -> from databricks.sql.backend.kernel.client import KernelDatabricksClient
        -> import databricks_sql_kernel  # ImportError on +PyArrow CI

The eager re-export from `__init__.py` was a convenience that
broke every consumer that only needed a submodule (type_mapping,
result_set, auth_bridge) — they all triggered the kernel wheel
import for no reason.

Fix:
- Drop the eager re-export from `kernel/__init__.py`. Comment
  documents why and points callers (= session.py::_create_backend,
  already this shape) at the direct `from .client import ...`.
- Drop the no-longer-needed `pytest.importorskip("pyarrow")` /
  `importorskip("databricks_sql_kernel")` from
  test_kernel_auth_bridge.py — auth_bridge.py itself has neither
  dep, so the test now runs on every CI matrix variant.
- test_kernel_result_set.py and test_kernel_type_mapping.py keep
  the pyarrow importorskip because they themselves use pyarrow.

Verified locally across the three matrix shapes:
- both pyarrow + kernel installed: 39 pass.
- pyarrow only (no kernel wheel — the +PyArrow CI shape): 39 pass.
- neither: 9 pass (auth_bridge only), 2 modules skip (the others
  use pyarrow).

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
…sing

The connector's coverage CI job runs the full e2e suite, several of
whose test classes parametrize ``extra_params`` over ``{}`` and
``{"use_sea": True}``. With ``use_sea=True`` now routing through
the Rust kernel via PyO3, those cases die at ``connect()`` with our
pointed ImportError because the ``databricks-sql-kernel`` wheel
isn't yet on PyPI — and that CI job (sensibly) doesn't try to
build it from a sibling repo.

Fix: ``pytest_collection_modifyitems`` hook in the top-level
``conftest.py`` that adds a ``skip`` marker to any parametrize case
with ``extra_params={"use_sea": True, ...}`` when
``importlib.util.find_spec("databricks_sql_kernel")`` returns
``None``. Behavior change is CI-only — local dev with the kernel
wheel installed (via ``maturin develop`` from the kernel repo)
runs those cases as before.

Once the kernel wheel is published, the [kernel] extra in
pyproject.toml gets enabled (see comment block there) and the
default-deps CI matrix will install it; the skip then becomes a
no-op.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
@vikrantpuppala
Copy link
Copy Markdown
Contributor Author

CI status, final: 33 / 34 checks pass.

The one failing check (test-with-coverage, the e2e + coverage job) reports 7 test failures — but all 7 are pre-existing on main, unrelated to this PR. Verified by reproducing them on an independent recent PR (PR for thrift-result-set-heartbeat, run 25725422361, 2 days ago) and confirming the same 6 TestMstMetadata cases fail there too with identical error messages.

Failure breakdown:

Test Failure Root cause
TestMstMetadata::test_cursor_{columns,schemas,tables,catalogs}_in_mst (4) DatabaseError: GetCatalogs/Schemas/Tables/Columns is not supported within a multi-statement transaction Server-side change between Apr 28 (last green main run) and today; affects every PR's e2e job
TestMstMetadata::test_cursor_{columns,tables}_non_transactional_after_concurrent_* (2) Same server-side MST limitation Same as above
TestPySQLLargeWideResultSet::test_query_with_large_wide_result_set[True-extra_params0] (1) AttributeError: 'TestPySQLLargeWideResultSet' object has no attribute 'assertEqual' Latent bug from #772 ("Optimize CI"). Split LargeQueriesMixin test classes off unittest.TestCase but fetch_rows still calls test_case.assertEqual. Triggers only when the fetch completes inside the 5-min budget.

My PR contributions to this job (all working as intended):

  • 53 tests now skip cleanly (vs failing). All the extra_params={"use_sea": True} cases — see the conftest hook in 8958e76e. Once databricks-sql-kernel ships to PyPI and the [kernel] extra is wired, these run as live tests.
  • 927 tests pass.

I don't believe my PR should be responsible for fixing the 7 pre-existing failures — they need their own fixes (server-side investigation for MST metadata; switching fetch_rows to pytest assertions for the large-result test). Happy to file follow-up issues if you want.

Comment thread src/databricks/sql/session.py Outdated
logger.debug("Creating Thrift backend client")
databricks_client_class = ThriftDatabricksClient
# `use_sea=True` now routes through the Rust kernel via
# PyO3. The native pure-Python SEA backend
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High severity — multi-reviewer consensus (architecture, agent-compat, devils-advocate, ops, test)

use_sea=True previously routed to SeaDatabricksClient (the pure-Python SEA REST backend). After this PR, the same kwarg routes to a Rust PyO3 wheel that isn't on PyPI yet. Side effects on existing use_sea=True users:

  • ImportError on upgrade unless they separately install databricks-sql-kernel
  • OAuth / federation / external auth → NotSupportedError (auth_bridge.py)
  • Parameter binding → NotSupportedError (client.py:476-484)
  • query_tagsNotSupportedError
  • Volume PUT/GET → unsupported
  • Telemetry mis-reports kernel sessions as DatabricksClientType.SEA
  • The native SEA backend (backend/sea/, ~700 LOC) is now zombie code: still in the tree, no longer reachable through any documented entry point.

The module docstring at backend/kernel/__init__.py even says the module's identity is deliberately decoupled from SEA REST (the kernel may switch transport SEA REST → SEA gRPC → …). That contradicts the flag name.

Recommend: introduce use_kernel=True as a new explicit flag. Leave use_sea=True routing to SeaDatabricksClient for now. Deprecate use_sea on a published timeline once the kernel reaches feature parity. Bundles cleanly with the related issues: docstring update at client.py:117 ("Use the SEA backend instead of the Thrift backend") is now factually wrong; CHANGELOG.md has no entry for this behavior change; no version bump; telemetry still reports DatabricksClientType.SEA for kernel sessions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 24e9a5c — introduced a dedicated use_kernel=True flag. use_sea=True once again routes to the native pure-Python SeaDatabricksClient (unchanged); the new flag is opt-in and mutually exclusive. PR title + description updated to match.


logger.debug("Creating kernel-backed client for use_sea=True")
return KernelDatabricksClient(
server_hostname=server_hostname,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High severity — flagged by ops, devils-advocate, architecture, maintainability, language

The kernel branch hardcodes 8 kwargs into KernelDatabricksClient(...). The Thrift branch (below) splats **kwargs. Silently dropped on use_sea=True:

  • _socket_timeout
  • All _retry_* knobs (_retry_stop_after_attempts_count, _retry_delay_*, …)
  • _tls_no_verify, _tls_verify_hostname, _tls_trusted_ca_file
  • pool_maxsize
  • use_cloud_fetch, use_hybrid_disposition, enable_query_result_lz4_compression
  • staging_allowed_local_path
  • query_tags
  • User-agent extras
  • _use_arrow_native_complex_types

KernelDatabricksClient.__init__ accepts **kwargs and never references them — accept-and-ignore with zero log line.

The most dangerous case: a user setting _tls_no_verify=True on Thrift (for an on-prem proxy / self-signed cert) gets that honored. On use_sea=True it silently no-ops and the kernel's own TLS stack will verify the cert. The operator believes verification is disabled when it isn't. Same for custom CA bundle (_tls_trusted_ca_file).

Worse: DriverConnectionParameters in telemetry (client.py:396-398) continues reporting socket_timeout=kwargs.get("_socket_timeout", None) — so dashboards lie about what's actually applied.

Recommend: at minimum, log a single WARNING at session-open enumerating which kwargs the kernel backend cannot honor (one-shot per process). Long-term, plumb retry/timeout/proxy/TLS through to the kernel, or refuse to start when unsupported knobs are set.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred — called out as a known gap in the updated PR description. The kernel manages its own HTTP stack today (TLS, retry, timeout, pool); we'll plumb a per-knob bridge as those surfaces appear kernel-side. Switching to use_kernel=True (24e9a5c) means existing use_sea=True callers — who relied on ssl_options / http_headers / retry knobs being honored on the SEA backend — are unaffected.

logger.warning("Error closing kernel handle: %s", exc)
self._buffer.clear()
self._kernel_handle = None
self._exhausted = True
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High severity — flagged by architecture, ops, performance

The base ResultSet.close() (src/databricks/sql/result_set.py:166-190) calls self.backend.close_command(self.command_id) to free server-side state and to drop client-side tracking. KernelResultSet.close() overrides the base entirely and calls only self._kernel_handle.close().

For the sync execute path this is OK (the executed handle owns the server statement; closing it releases server state). For the async path (execute_command(async_op=True)), the handle is also tracked in KernelDatabricksClient._async_handles. Result:

  1. User calls cursor.execute(..., async_op=True) → handle stored in _async_handles.
  2. User calls cursor.fetchall() → result set built.
  3. User calls cursor.close()KernelResultSet.close() calls _kernel_handle.close() directly.
  4. _async_handles[command_id.guid] still holds the now-closed handle.
  5. Later, close_session() iterates _async_handles.values() and calls .close() again on the dead handle.

The kernel's close() is idempotent per the PR's docstring, so this isn't a crash — but the bookkeeping is inconsistent and leaks an entry for every async-submitted statement that closes through the result-set path. Long-lived connections accumulate dead entries.

Recommend: KernelResultSet.close() should either (a) call super().close() (which calls backend.close_command), or (b) explicitly self.backend._async_handles.pop(self.command_id.guid, None) after closing the handle.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 24e9a5c. KernelResultSet.close() now pops the entry from backend._async_handles (under the new _async_handles_lock). No-op for sync-execute and metadata paths, which never register there.

self.has_more_rows = False
self.status = CommandState.SUCCEEDED
return False
if batch.num_rows > 0:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High severity — performance, empirically verified

def _buffered_rows(self) -> int:
    if not self._buffer:
        return 0
    first = self._buffer[0].num_rows - self._buffer_offset
    rest = sum(b.num_rows for b in list(self._buffer)[1:])  # allocates + O(M)
    return first + rest

Two issues:

  1. list(self._buffer)[1:] allocates a fresh list on every call — gratuitous. Use itertools.islice(self._buffer, 1, None) or iterate the deque.
  2. _ensure_buffered calls _buffered_rows() in a loop, once per pulled batch → O(M²) in batch count.

Empirically verified with a synthetic harness:

  • 5,000 single-row batches → ~447ms just in the row-count loop (Python-side, no pyarrow).

Hot path: fetchmany(small_N) / fetchone() repeatedly when the kernel returns many small batches, or fetchall() over a deep stream.

Fix (~10 lines): track self._buffered_count: int as a running counter — += batch.num_rows in _pull_one_batch, -= take in _take_buffered, recompute on _drain. _buffered_rows() becomes O(1); _ensure_buffered becomes O(M).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 37fa544. Replaced _buffered_rows with a running counter _buffered_count maintained by _pull_one_batch / _take_buffered / _drain. _buffered_rows is now O(1); _ensure_buffered is O(M) in batch count instead of O(M²).

Comment thread conftest.py

@pytest.fixture(scope="session")
def host():
return os.getenv("DATABRICKS_SERVER_HOSTNAME")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High severity — flagged by test, ops

The new pytest_collection_modifyitems skips every extra_params={"use_sea": True} case when the kernel wheel isn't importable. CI installs --all-extras (.github/workflows/code-coverage.yml:39), but pyproject explicitly does NOT declare a [kernel] extra (per the PR's own comment at pyproject.toml:55-68). Result: every one of those cases is reported SKIPPED on every CI run.

Verified via grep against tests/e2e/test_driver.py — there are ~14 distinct @pytest.mark.parametrize("extra_params", [{}, {"use_sea": True}]) functions (test_query_with_large_wide_result_set, test_long_running_query, test_execute_async__*, test_unicode, test_fetchone, test_fetchall, test_fetchmany_*, test_iterator_api, test_multi_timestamps_arrow, …) plus 4 SEA-parametrized retry tests in tests/e2e/common/retry_test_mixins.py.

Before this PR they ran against SeaDatabricksClient. After this PR they vanish from the CI matrix entirely. This is a silent capability loss across the whole e2e suite, not just within the new code. Combined with F9 (no unit tests for the 511-LOC client.py), coverage of the kernel backend in CI may also fail the 85% threshold gate at .github/workflows/code-coverage.yml:80-81.

Recommend: (a) install the kernel wheel in CI before running e2e, OR (b) document this regression in the PR description so reviewers know use_sea=True e2e coverage in CI is currently 0, AND (c) file a follow-up to land kernel-wheel CI coverage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 24e9a5c — removed the conftest collection hook entirely. With use_sea=True back on the native SEA backend, the existing extra_params=[{}, {"use_sea": True}] parametrized cases run as they did before this PR (no skip needed). The kernel backend is now opt-in via use_kernel=True and doesn't intercept existing e2e parametrizations.

@@ -0,0 +1,511 @@
"""``DatabricksClient`` backed by the Rust kernel via PyO3.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High severity — flagged by test, devils-advocate

auth_bridge, result_set, and type_mapping each get unit coverage; the largest, most behavior-rich file in the PR has none. Only the e2e file exercises it — and that file skips silently when creds OR the kernel wheel are missing (which is the CI default, see F8).

Concretely uncovered by any non-live test:

  • _CODE_TO_EXCEPTION mapping (14 entries) — trivially testable with a fake _kernel.KernelError; a typo on "Unauthenticated" collapses silently to DatabaseError.
  • _reraise_kernel_error attribute forwarding — copies 7 structured fields (code, sql_state, error_code, vendor_code, http_status, retryable, query_id). E2E only verifies code + sql_state.
  • open_session double-open guardraise InterfaceError(...) not exercised.
  • All 5 InterfaceError "no open session" guardsexecute_command, get_catalogs, get_schemas, get_tables, get_columns.
  • get_columns catalog_name required check — e2e always passes a catalog.
  • execute_command parameters / query_tags NotSupportedError — regression risk that accepting parameters would silently dispatch to a kernel Statement with no bind_param.
  • cancel_command / close_command no-handle tolerant path.
  • close_session cleanup of _async_handles including swallow-on-KernelError.
  • get_query_state sync-path SUCCEEDED shortcut and Failed-state re-raise.
  • _STATE_TO_COMMAND_STATE mapping (6 entries).
  • max_download_threads property.
  • get_tables table_types warning behavior.

All achievable with a MagicMock _kernel module via monkeypatch.setattr. The auth-bridge and result-set tests demonstrate the pattern. The absence of tests/unit/test_kernel_client.py is the single biggest test-quality issue in this PR.

Recommend: add tests/unit/test_kernel_client.py — ~150 LOC covers the items above.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 24e9a5c. Added tests/unit/test_kernel_client.py with 38 cases covering: full _CODE_TO_EXCEPTION (14-entry parametrize), _reraise_kernel_error attribute forwarding, full _STATE_TO_COMMAND_STATE (6-entry parametrize), all 5 no-open-session guards, open_session double-open, parameters and query_tags rejection, get_columns catalog-required, cancel_command / close_command tolerance, get_query_state sync-path SUCCEEDED and Failed-state re-raise, synthetic CommandId UUID shape, and close_session cleanup-on-failure. Uses a fake databricks_sql_kernel module installed into sys.modules so it runs without the Rust extension. 77/77 kernel unit tests pass locally.

federated.add_headers = base.add_headers
kwargs = kernel_auth_kwargs(federated)
assert kwargs == {"auth_type": "pat", "access_token": "dapi-abc"}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium severity — flagged by language, maintainability, devils-advocate, test, security

federated = TokenFederationProvider.__new__(TokenFederationProvider)
federated.external_provider = base
federated.add_headers = base.add_headers

This bypasses __init__ (which requires http_client and normalizes hostname) AND monkey-patches over the real TokenFederationProvider.add_headers — the one containing all the token-exchange logic. The assertion kwargs == {"auth_type": "pat", "access_token": "dapi-abc"} therefore says nothing about whether the bridge handles the real federation flow. It passes "by accident" because a dapi-… token isn't a JWT, so the real path's _should_exchange_token happens to return False.

Any future change to TokenFederationProvider.add_headers (added telemetry, new refresh trigger, eager exchange) silently breaks the bridge while this test stays green. With a real-init federated provider, add_headers writes the federation-exchanged token (not the original PAT) — the bridge would extract that token, not the underlying PAT. The test name test_federation_wrapped_pat_routes_to_kernel_pat overstates what's verified.

tests/unit/test_token_federation.py:31-36 already demonstrates clean construction with a MagicMock http_client.

Recommend:

federated = TokenFederationProvider(
    hostname="https://example.cloud.databricks.com",
    external_provider=base,
    http_client=Mock(),
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 37fa544. The test now constructs a real TokenFederationProvider(http_client=Mock()) and exercises its actual add_headers path; for a plain dapi-… PAT _should_exchange_token returns False (not a JWT) so no exchange fires and the mock http_client is never invoked.

def get_catalogs(
self,
session_id: SessionId,
max_rows: int,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium severity — flagged by architecture, devils-advocate, ops

_async_handles is a single dict per KernelDatabricksClient. Multiple cursors on the same connection share it via self.backend. Mutations happen in execute_command (insert), close_command (pop), cancel_command (get), close_session (iterate-then-clear) — all unlocked.

Two threads issuing async statements concurrently are safe in CPython by GIL accident but not by design. Worse, close_session does for handle in list(self._async_handles.values()): ...; self._async_handles.clear() — a thread mid-execute_command(async_op=True) could add a new handle after the iterator copy is taken but before clear(). That handle is dropped on the floor with no .close() called — kernel-side state leaks.

The connector explicitly documents thread-safety per cursor; this regresses below that bar for shared async tracking.

Recommend: wrap mutations in a threading.RLock, or document non-thread-safety in the class docstring with an explicit warning.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 24e9a5c. Added self._async_handles_lock = threading.RLock() and wrapped every read/mutation site (execute_command insert, cancel_command / close_command / get_query_state / get_execution_result reads, close_session iterate+clear). The close_session pattern is now snapshot-under-lock then close-outside-lock — newly-added handles after the snapshot stay in the dict for the next sweep instead of being dropped on the floor.

None,
)
for field in schema
]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium severity — flagged by agent-compat, maintainability

return str(arrow_type) for unrecognized types produces strings like "fixed_size_binary[16]" or "timestamp[ns, tz=UTC]" in cursor.description[i][1]. Code (and LLM agents) branching on description type strings — a common pattern, e.g., if col_type == "timestamp": — silently miss these cases.

The unit test only verifies pa.null() falls through to "null"; the visually ugly cases aren't covered.

pa.timestamp("us") and pa.timestamp("ns", tz="UTC") both pass is_timestamp and map to "timestamp" (fine), but other parametrized types (fixed-size, dictionary-encoded, union) fall to str(arrow_type).

Recommend: either (a) document the fallback shape in the module docstring so callers know to handle parameterized type strings, (b) lowercase + strip parameters before returning, or (c) extend the explicit list to cover the parametrized variants the kernel actually returns.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. Documented as a follow-up — the kernel will eventually return a richer Arrow type surface, and the right shape is to expand the explicit table when kernel adds parameterized types we care about, not to lossily lowercase/strip on the connector side.

return
try:
handle.close()
except _kernel.KernelError as exc:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium severity — flagged by language

Signature is (exc: BaseException) -> "Error" with # type: ignore[return-value] on the non-KernelError passthrough. The ignore is masking a real type issue: this function's contract is "either re-raise as Error or pass through."

Every caller in the file does raise _reraise_kernel_error(exc) after an isinstance(exc, KernelError) check (12 callers), so the non-KernelError passthrough is unreachable in practice.

Recommend: delete the dead branch and tighten the signature to (exc: _kernel.KernelError) -> Error (called only after an isinstance check). Or, if the passthrough is intentional, change return-type to Union[BaseException, Error] and drop the ignore. Deleting is simpler.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 37fa544. Tightened the signature to (exc: _kernel.KernelError) -> Error, dropped the unreachable passthrough branch and the # type: ignore[return-value], and replaced the defensive setattr try/except with a plain setattr(new, attr, getattr(exc, attr, None)) since none of the PEP 249 exception classes use __slots__.

buffer_size_bytes=cursor.buffer_size_bytes,
)

# ── Metadata ───────────────────────────────────────────────────
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium severity — flagged by maintainability, language

_use_arrow_native_complex_types: Optional[bool] = True is accepted in __init__ but never read. Not passed by session.py::_create_backend either. Drop entirely.

Also: Optional[bool] = True is essentially Union[bool, None] defaulted to True — the Optional is misleading because None and True are both acceptable for what would be a "use default" sentinel. If kept, restrict to bool = True.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 37fa544. Removed _use_arrow_native_complex_types from KernelDatabricksClient.__init__ — it was accepted but never read, and session.py::_create_backend doesn't pass it for the kernel branch.

batch size; ``fetchall`` drains the whole stream.
"""

from __future__ import annotations
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium severity — flagged by test, language

The class docstring claims the duck-typed kernel_handle must implement arrow_schema() / fetch_next_batch() / fetch_all_arrow() / close(). The production code never calls fetch_all_arrowKernelResultSet streams via fetch_next_batch() + a custom _drain. The _FakeKernelHandle test double also doesn't implement fetch_all_arrow.

If the kernel adds a required method (e.g., fetch_next_batch(timeout=...) becomes mandatory), unit tests still pass and the regression lands in e2e — which is silently skipped per F8.

Recommend:

  • Drop fetch_all_arrow from the docstring contract, OR
  • Add a contract test that (when databricks_sql_kernel is importable) asserts the duck-typed methods are actually exposed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 37fa544. Renamed _metadata_result_make_result_set and routed the sync-execute path (was client.py:510-517) and get_execution_result (was client.py:577-584) through it. Single construction site now.

arraysize: int,
buffer_size_bytes: int,
):
schema = kernel_handle.arrow_schema()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low severity — flagged by performance

buffer_size_bytes is accepted by the constructor and forwarded to the base ResultSet, but never consulted by the kernel backend. The kernel currently caps buffer by rows-pulled, not bytes.

Recommend: document the no-op (a comment in the class docstring or constructor) so callers tuning buffer_size_bytes for memory ceilings on Thrift know it doesn't apply on use_sea=True.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 37fa544. Added a paragraph to the KernelResultSet class docstring explicitly documenting that buffer_size_bytes is accepted for base-class contract compatibility but is not consulted — kernel currently caps by rows pulled, not bytes.

lz4_compressed=False,
arrow_schema_bytes=None,
)
self._kernel_handle = kernel_handle
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low severity — flagged by maintainability

The base ResultSet expects a results_queue with a next_n_rows / remaining_rows / close interface — both ThriftResultSet and SeaResultSet use it. KernelResultSet passes results_queue=None and duplicates _pull_one_batch / _ensure_buffered / _take_buffered / _drain (~80 lines) plus its own fetch overrides.

The docstring even acknowledges the duplication: "Buffer shape mirrors the prior ADBC POC's AdbcResultSet."

Recommend (follow-up): extract BufferedArrowQueue(results_queue) wrapping any handle implementing arrow_schema() / fetch_next_batch() / close(). Both KernelResultSet and any future AdbcResultSet become 15-line constructor-only subclasses, and the base ResultSet.fetchmany_arrow / fetchall_arrow works unchanged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred — this is a cross-cutting refactor that would touch both kernel and SEA backends. Tracked as a follow-up; appropriate to land alongside the ADBC POC if/when it gets revived.

buffer_size_bytes=cursor.buffer_size_bytes,
)

def _synthetic_command_id(self) -> CommandId:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low severity — flagged by security

self._auth_kwargs = {"auth_type": "pat", "access_token": <raw_token>} is stored on the KernelDatabricksClient instance for its life. Thrift / native-SEA materialize the token only inside per-request add_headers calls. The kernel client elevates the cleartext token to a long-lived attribute on a connector object — at risk of accidental pickling, debugger dumps, or telemetry capture.

Recommend: clear self._auth_kwargs (or just self._auth_kwargs["access_token"]) immediately after _kernel.Session(...) returns in open_session. Or move it into a closure rather than an instance attribute.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 37fa544. Added a finally block to open_session that pops access_token from self._auth_kwargs after the kernel Session is constructed (or failed). Kernel owns the credential from then on; no cleartext copy stays on the long-lived connector object.

kernel side."""
with conn.cursor() as cur:
cur.execute("SELECT * FROM range(10000)")
rows = cur.fetchall()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low severity — flagged by test

Docstring says SELECT * FROM range(10000) "exercises the CloudFetch / multi-batch path on the kernel side". 10000 BIGINT rows is ~80 KB — almost certainly a single inline chunk on a typical warehouse. Existing CloudFetch-aimed tests (tests/e2e/test_driver.py:145-180) use 100 MB / 12.5M rows.

The comment overstates scope. A future reader may believe CloudFetch is covered when it isn't.

Recommend: drop the misleading claim, or scale to range(2_000_000) to actually cross a chunk boundary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 37fa544. Replaced the misleading 'exercises CloudFetch / multi-batch path' claim with a note that it covers end-of-stream drain over multiple fetch_next_batch calls and isn't large enough for CloudFetch — pointing to test_driver for CloudFetch coverage.

Comment thread tests/unit/test_kernel_auth_bridge.py Outdated

from __future__ import annotations

from unittest.mock import MagicMock
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low severity — flagged by test

from unittest.mock import MagicMock is imported but never used in this file. Dead import.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 37fa544. Replaced MagicMock import with Mock since the federation test now needs the latter; no longer unused.

@vikrantpuppala
Copy link
Copy Markdown
Contributor Author

Code Review Squad — Failed Inline Comments

Could not post inline comments for: F2, F4, F7, F10, F11, F13, F14, F19, F22, F25 — see body below.


F2 — Federated-PAT token refresh is dead — single snapshot at construct time

High severity — flagged by security + devils-advocate

kernel_auth_kwargs extracts the bearer token once via auth_provider.add_headers({}) at construct time. The result is stored on KernelDatabricksClient._auth_kwargs and spread into _kernel.Session(...) at open_session. After that, the kernel uses that frozen token for the session lifetime.

get_python_sql_connector_auth_provider wraps every provider in TokenFederationProvider. For a PAT whose issuer host differs from the workspace host, TokenFederationProvider._get_token (src/databricks/sql/auth/token_federation.py:131-153) performs an OAuth token-exchange and caches the exchanged token, refreshing it on subsequent add_headers calls when it expires.

The bridge captures only the first exchanged token and never re-extracts. Long-running kernel sessions outlive the exchanged token's TTL and start failing Unauthenticated mid-session — even though the connector-side TokenFederationProvider would happily mint a fresh one.

The bridge's own docstring acknowledges this failure mode while justifying OAuth rejection: "routing OAuth through PAT would silently break token refresh during long-running sessions." That exact failure mode applies to the federated-PAT case the bridge accepts.

Recommend: either (a) propagate a refresh callback into the kernel Session, (b) block federation-wrapped PAT until the kernel exposes a refresh hook, or (c) refresh _auth_kwargs immediately before each kernel call (defeats the purpose of caching but is correct).


F4 — get_tables with table_types silently returns wrong-answer result

High severity — flagged by devils-advocate, agent-compat, ops, security

The warning says "client-side table_types filter not yet implemented … returning unfiltered rows for %r" — and then the code passes table_types=table_types to _kernel_session.metadata().list_tables(...) anyway on the next line.

Either the kernel filters server-side (in which case the warning is wrong and misleading — callers think they're getting unfiltered rows when they aren't) or it doesn't (the kwarg passthrough is dead). Both states are bad:

  • BI tools / schema browsers call getTables(table_types=["TABLE", "VIEW"]) and iterate. If the result is unfiltered, they silently see other table types mixed in — producing wrong UI / wrong downstream answers.
  • Access-boundary callers using table_types as a filter get either a silent wrong answer or an undocumented divergence from the connector contract.

logger.warning is also invisible to programmatic consumers — it's not actionable.

Recommend: until the filter is correctly implemented, raise NotSupportedError when table_types is non-empty. Same pattern as parameters and query_tags rejection elsewhere in this client. A logger.warning plus silent wrong-answer is the worst of both worlds.


F7 — get_tables warning is log spam at metadata-call frequency

High severity — flagged by ops, agent-compat, architecture

logger.warning fires on every get_tables call with a non-empty table_types. Metadata calls are made by every BI tool that browses a workspace: Power BI, Tableau, dbt, DataGrip, etc. — all call getTables(table_types=["TABLE", "VIEW"]) on connection open and every catalog-tree refresh. At Databricks customer scale this is easily 100s of QPS per workspace.

WARN-level log spam at that rate triggers alert fatigue for any operator with a "WARN-or-above" alarm and dominates the kernel-backend portion of stdout/stderr at scale.

Recommend: demote to logger.debug, or move to a once-per-session lazy-init flag (operator sees it once during the session, not on every call). Even logger.info is borderline at metadata-call frequency.


F10 — Telemetry is structurally degraded for kernel-backed cursors

Medium severity — flagged by ops, architecture

Three places the per-statement / per-chunk telemetry that downstream dashboards depend on is silently broken for use_sea=True:

  1. _extract_cursor_data:54 reads cursor.active_result_set.results and isinstance-checks against ColumnQueue / CloudFetchQueue / ArrowQueue. KernelResultSet.__init__ passes results_queue=Noneself.results is None on every kernel result set. Every kernel-backed query will telemetry-report ExecutionResultFormat.FORMAT_UNSPECIFIED — a silent dashboard-level regression for anyone slicing latency by execution format.
  2. _extract_cursor_data:68 reads cursor.backend.retry_policy for retry_count. KernelDatabricksClient has no retry_policy attribute — kernel manages retries internally. Every kernel-backed cursor will report retry_count=0. Retry-count anomalies are a leading signal for partial outages.
  3. CloudFetch chunk-level telemetry (@log_latency on ResultSetDownloadHandler) only fires when the connector's own CloudFetch downloader runs. The kernel does its own HTTP fetching of cloud URLs — so chunk-level latency / retry / decompression telemetry is entirely absent.

Plus: connection-level DriverConnectionParameters still reports DatabricksClientType.SEA for kernel sessions (src/databricks/sql/client.py:379-381).

Recommend: either (a) plumb a telemetry hook from the kernel through KernelResultSet (kernel exposes chunk-fetch / retry-count summaries that the connector forwards into SqlExecutionEvent), or (b) document this as a known-limitation banner in the PR description and acknowledge dashboards keyed on these fields under-report for use_sea=True.


F11 — cancel_command is a no-op on the sync-execute path; get_query_state reports SUCCEEDED

Medium severity — flagged by ops, devils-advocate, agent-compat

Sync executes never populate _async_handles. So calling cancel_command for a still-draining sync cursor logs at DEBUG and returns. Combined with get_query_state (client.py:543-549) returning CommandState.SUCCEEDED for unknown command_ids, a cursor.cancel() followed by get_query_state() falsely reports success — the query continues running server-side.

The inline comment claims this matches Thrift's tolerant behavior — but Thrift's tolerance is for race-y double-cancels, not "cancel was called for a command we never tracked."

Recommend: either (a) for sync-execute results, store the executed handle in a separate _sync_handles map keyed on command_id so cancel_command can call handle.cancel() on it, or (b) explicitly document that cursor.cancel() after execute() returned (i.e., during result drain) is a no-op on use_sea=True. Today it's both silently broken and not documented.


F13 — get_columns(catalog_name=None) raises ProgrammingError; Thrift accepts None — silent semantic drift

Medium severity — flagged by devils-advocate, agent-compat, security, architecture

KernelDatabricksClient.get_columns raises ProgrammingError("get_columns requires catalog_name on the kernel backend.") when catalog_name is None. The Thrift backend (src/databricks/sql/backend/thrift_backend.py:1217-1244) accepts catalog_name=None and forwards it to the server. Cursor.columns() (src/databricks/sql/client.py:1567-1593) keeps catalog_name: Optional[str] = None in its public signature.

Cross-backend code (third-party tooling doing schema discovery) that worked on use_sea=False will start raising on the same call under use_sea=True. The error message is actionable, but the divergence isn't documented in the Cursor.columns() docstring.

Recommend: either (a) narrow the Cursor.columns() signature for kernel mode, (b) fall back to a per-catalog scan inside the kernel client, or (c) document the divergence on Cursor.columns() so users know to branch on backend.


F14 — Synthetic metadata-{uuid} CommandIds pollute telemetry and cursor.query_id

Medium severity — flagged by architecture, maintainability, devils-advocate

_synthetic_command_id returns CommandId.from_sea_statement_id(f"metadata-{uuid.uuid4()}") — i.e., a CommandId(backend_type=SEA, guid="metadata-..."). Followed the consumers:

  • Cursor.query_id returns self.active_command_id.to_hex_guid() which for SEA-typed CommandId returns the literal string "metadata-9f3d…".
  • Telemetry log builders (src/databricks/sql/telemetry/) record query_id verbatim. Synthetic IDs pollute the telemetry stream.
  • Anything downstream that does int(statement_id, 16) or uuid.UUID(statement_id) would choke. No such consumer in src/ today, but downstream observability pipelines that ingest query_id may.

The synthetic ID also misrepresents the backend (BackendType.SEA when this is the kernel).

Recommend: either (a) make the synthetic ID look like a UUID (uuid.uuid4().hex with no metadata- prefix — lose human-readability for telemetry safety), (b) add a CommandId.for_kernel_metadata() factory in backend/types.py so the synthetic-ness is explicit, or (c) don't set active_command_id for metadata calls (let query_id stay None as it does for any cursor before execute).


F19 — Three near-identical return KernelResultSet(...) blocks should call _metadata_result

Medium severity — flagged by maintainability

client.py:510-517 (sync execute), client.py:577-584 (get_execution_result), and client.py:588-596 (metadata) all build KernelResultSet with the same 6 args. The metadata path is already factored out as _metadata_result(stream, cursor, command_id). The other two should call it too — saves 12 lines and removes a divergence risk if a new kwarg is added to KernelResultSet.


F22 — auth_bridge._is_pat duplicates TokenFederationProvider-peek logic from telemetry

Low severity — flagged by maintainability

_is_pat peeks through TokenFederationProvider to find a wrapped AccessTokenAuthProvider. The same pattern lives at src/databricks/sql/telemetry/telemetry_client.py:94-101:

if isinstance(auth_provider, TokenFederationProvider):
    return TelemetryHelper.get_auth_mechanism(auth_provider.external_provider)
if isinstance(auth_provider, AccessTokenAuthProvider):
    return AuthMech.PAT

Recent commit cbd6a883 ("Telemetry: unwrap TokenFederationProvider…") shows the team explicitly cares about this surface.

Recommend: extract auth/utils.py::unwrap_to_pat_provider(auth_provider) -> Optional[AccessTokenAuthProvider] shared by both call sites. Keeps them in lock-step when OAuth-through-kernel ships. Not a blocker.


F25 — No control-char / CRLF sanitization on extracted bearer token

Low severity — flagged by security

_extract_bearer_token returns auth[len("Bearer "):] verbatim. AccessTokenAuthProvider.__init__ doesn't validate either. User-supplied access_token= strings containing \r\n, \0, or other control chars flow straight through to the kernel. If the kernel layer ever places this back into an HTTP header without scrubbing, it's a header-injection sink.

Recommend: defense-in-depth — reject tokens matching [\x00-\x1f\x7f] at the connector boundary. Cheap.


Summary: 17 of 27 findings posted as inline comments. 10 failed due to line-number validation (not in diff context). All critical/high/medium findings are covered — either by inline comment or in this summary.

Cleanup pass on the kernel-backend PR addressing reviewer feedback
that doesn't change observable behaviour:

- result_set.py: replace O(M²) `_buffered_rows` with running counter
  `_buffered_count` maintained by pull/take/drain (perf F6).
- result_set.py: docstring corrections — drop nonexistent
  `fetch_all_arrow` from kernel-handle contract (F20); document
  `buffer_size_bytes` as no-op on the kernel backend (F21).
- client.py: tighten `_reraise_kernel_error` signature to
  `_kernel.KernelError` only; drop dead passthrough branch and the
  defensive setattr try/except (F17).
- client.py: drop unused `_use_arrow_native_complex_types` kwarg (F18).
- client.py: collapse three `KernelResultSet(...)` construction sites
  through `_make_result_set` (renamed from `_metadata_result`) (F19).
- client.py: drop `metadata-` prefix from synthetic CommandId; use a
  plain `uuid.uuid4().hex` so anything reading `cursor.query_id`
  downstream sees a UUID-shaped string (F14).
- client.py: clear the raw access token from `_auth_kwargs` after the
  kernel session is constructed — kernel owns the credential from
  then on, no need to retain a cleartext copy on the connector
  instance (F24).
- auth_bridge.py: reject bearer tokens containing ASCII control
  characters at extraction time (defense-in-depth against header
  injection if a misbehaving HTTP stack ever places the token back
  into a header without scrubbing) (F25).
- tests/unit/test_kernel_auth_bridge.py: construct a real
  `TokenFederationProvider(http_client=Mock())` instead of bypassing
  `__init__` with `__new__` + monkey-patching `add_headers`. Exercises
  the real federation passthrough path the bridge sees in production
  (F12). Drop unused `MagicMock` import (F27).
- tests/e2e/test_kernel_backend.py: drop misleading CloudFetch claim
  on `test_drain_large_range_to_arrow` — 10000 BIGINT rows is ~80 KB,
  single inline chunk on a typical warehouse (F26).

All 39 existing kernel unit tests pass.

Co-authored-by: Isaac
…ve review fixes

Major change: route the kernel backend through a new ``use_kernel=True``
connection kwarg instead of repurposing ``use_sea=True``. ``use_sea=True``
once again routes to the native pure-Python SEA backend (no behaviour
change); ``use_kernel=True`` routes to the Rust kernel via PyO3. The
two flags are mutually exclusive.

This addresses the largest reviewer concern from the multi-agent
review: silently hijacking a documented public flag broke OAuth /
federation / parameter-binding callers on ``use_sea=True`` who had no
opt-out. With the new flag, the kernel backend is fully opt-in and
existing ``use_sea=True`` users continue to get the native SEA backend
they signed up for.

Other substantive fixes:

- session.py: restore ``SeaDatabricksClient`` import + routing. Reject
  ``use_kernel=True`` + ``use_sea=True`` together with a clear
  ``ValueError``.
- client.py (kernel ``Cursor.columns``): update docstring to flag the
  ``catalog_name=None`` divergence — kernel requires a catalog,
  Thrift / native SEA do not (F13).
- conftest.py: drop the collection-time ``pytest_collection_modifyitems``
  hook that was skipping ``extra_params={"use_sea": True}`` cases. With
  ``use_sea=True`` back on the native SEA backend, those cases run as
  they did before this PR (F8).
- kernel/client.py: ``get_tables`` now applies the ``table_types``
  filter client-side using ``ResultSetFilter._filter_arrow_table``
  (the same helper the native SEA backend uses), wrapped in a tiny
  ``_StaticArrowHandle`` that flows the filtered table back through
  the normal ``KernelResultSet`` path. Replaces the previous
  "log a warning and return unfiltered" behaviour (F4).
- kernel/client.py: guard ``_async_handles`` with ``threading.RLock``
  so concurrent cursors on the same connection don't race on
  submit / close / close-session (F15).
- kernel/result_set.py: ``KernelResultSet.close()`` now drops the
  entry from ``backend._async_handles`` so async-submitted statements
  don't leave stale references behind (F5).
- kernel/{__init__,client,auth_bridge}.py, tests/e2e/test_kernel_backend.py:
  update docstrings, error messages, and the e2e fixture to refer to
  ``use_kernel=True`` instead of ``use_sea=True``.
- client.py (``Connection`` docstring): document the new
  ``use_kernel`` kwarg + its Phase-1 limitations.

New tests:

- tests/unit/test_kernel_client.py (38 cases): cover the 14-entry
  ``_CODE_TO_EXCEPTION`` table, ``_reraise_kernel_error`` attribute
  forwarding, the 6-entry ``_STATE_TO_COMMAND_STATE`` table, the
  no-open-session guards on every method, ``open_session`` double-open,
  ``parameters`` / ``query_tags`` rejection, ``get_columns``'
  catalog-required check, ``cancel_command`` / ``close_command``
  no-handle tolerance, ``get_query_state`` sync-path SUCCEEDED, the
  Failed-state re-raise, the synthetic-command-id UUID shape, and
  ``close_session`` cleanup even when per-handle close errors fire.
  Uses a fake ``databricks_sql_kernel`` module installed into
  ``sys.modules`` so the test runs with no Rust extension dependency
  (F9).

77/77 kernel unit tests pass.

Co-authored-by: Isaac
@vikrantpuppala vikrantpuppala changed the title feat(backend/kernel): route use_sea=True through the Rust kernel via PyO3 feat(backend/kernel): add use_kernel=True flag — route through the Rust kernel via PyO3 May 15, 2026
@vikrantpuppala
Copy link
Copy Markdown
Contributor Author

Code-review responses — summary-only findings

Replies to the findings that landed in the summary comment (couldn't be posted as inline replies because the cited line was outside the diff hunk):

  • F2 — Federated-PAT token refresh dead. Acknowledged as a known gap in the updated PR description. The bridge currently snapshots the bearer at construct time; refresh has to come from the kernel side (no Python-side hook today). Switching to use_kernel=True (24e9a5c) makes this opt-in, so existing use_sea=True callers are unaffected.

  • F4 — table_types silent wrong-answer. Fixed in 24e9a5c. get_tables now applies a client-side filter using the SEA backend's ResultSetFilter._filter_arrow_table (column index 5 = TABLE_TYPE, case-sensitive), wrapped in a tiny _StaticArrowHandle that flows the filtered table back through the normal KernelResultSet path. Same semantics as the native SEA backend.

  • F7 — get_tables warning is log spam. Resolved as a side-effect of F4 — the warning is gone since the filter is now correctly applied.

  • F10 — Telemetry parity. Acknowledged as a known gap in the updated PR description. Per-statement execution_result / retry_count / chunk-level latency under-report for use_kernel=True. Will plumb kernel-side hooks as they appear.

  • F11 — cancel_command sync no-op. Documented; tracking as a follow-up. The kernel currently doesn't return cancelable handles from sync execute(); once it does, we'll track them in a _sync_handles map.

  • F13 — get_columns(catalog_name=None) divergence. Fixed in 24e9a5cCursor.columns() docstring now documents that catalog_name is required on use_kernel=True.

  • F14 — Synthetic metadata-{uuid} CommandId. Fixed in 37fa544 — now uses uuid.uuid4().hex (no prefix), so cursor.query_id stays parseable downstream.

  • F19 — Three redundant KernelResultSet(...) blocks. Fixed in 37fa544 — all three sites now route through _make_result_set (renamed from _metadata_result).

  • F22 — Auth-bridge / telemetry duplication. Deferred as a cross-cutting refactor — extracting an auth/utils.py::unwrap_to_pat_provider shared by both call sites would be its own follow-up.

  • F25 — Bearer token control-char sanitization. Fixed in 37fa544. _extract_bearer_token now rejects tokens matching [\x00-\x1f\x7f] with a clear ValueError at extraction time.

77/77 kernel unit tests passing locally on the new branch tip 24e9a5c2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant